Example¶
Another Example¶
Principal Component Analysis (PCA)¶
1. What is PCA?¶
PCA is a dimensionality reduction technique that transforms a large set of variables (features) into a smaller one that still contains most of the information (variance) in the large set.
- It does this by projecting the data onto a new set of orthogonal axes (principal components), ordered by how much variance they explain.
- PCA is unsupervised — it ignores the target variable
y.
2. Why Use PCA?¶
| Reason | Explanation |
|---|---|
| High Dimensionality | Too many input features (columns) lead to the "curse of dimensionality" — data becomes sparse, slow, and prone to overfitting. |
| Redundancy | Many features are correlated (e.g., height and arm length); PCA removes multicollinearity. |
| Noise Reduction | PCA filters out noise by keeping only the most informative components. |
| Visualization | Helps visualize high-dimensional data in 2D or 3D (e.g., PC1 vs PC2). |
| Preprocessing | Makes downstream ML tasks like clustering or classification more efficient. |
3. How Does PCA Work? (Step-by-Step)¶
Standardize the Data
- Mean = 0, Variance = 1 for all features
- PCA is sensitive to feature scales.
Compute the Covariance Matrix
- Measures how features vary with each other.
Calculate Eigenvalues and Eigenvectors
- Eigenvectors → Principal Components (PCs)
- Eigenvalues → Importance (variance explained) by each PC
Sort and Select Top-k Components
- Sort PCs by eigenvalues in descending order
- Choose top
kcomponents based on desired variance (e.g., 95%)
Project Original Data onto New k-Dimensional Space
4. Important Pointers¶
| Concept | Detail |
|---|---|
| Variance Explained | You can retain 95–99% variance using fewer components. |
| PCs are Orthogonal | No correlation between PCs. |
| Linear Method | PCA only captures linear patterns (no curves or complex boundaries). |
| Unsupervised | It does not use y (target). PCA is purely based on X. |
| Lossy Transformation | Original features can't be perfectly reconstructed from PCs. |
5. Advantages of PCA¶
- Reduces Overfitting – Less redundant features
- Faster Computation – Smaller data = quicker training
- Better Generalization – Removes noise
- Visual Insights – PCA helps explore patterns in 2D plots
- Removes Multicollinearity – Helpful before regression
6. Disadvantages of PCA¶
- Loss of Interpretability – You lose original feature meaning (e.g., PC1 = 0.4X1 + 0.6X2...)
- Only Linear – Cannot capture non-linear relationships
- Scaling Required – Sensitive to feature magnitudes
- Doesn’t Consider Target – Might reduce features important for prediction
- Not Good for Sparse Data – In text mining (like TF-IDF), PCA may not preserve important word features
7. Corner Cases / Gotchas¶
| Case | Explanation |
|---|---|
| Features on Different Scales | PCA performs poorly without standardization — always scale features first. |
| Too Few Samples | If samples < features (e.g., gene expression data), PCA may overfit. |
| Missing Values | PCA can’t handle missing values; impute or drop them first. |
| Categorical Variables | PCA only works with numeric data. Encode categoricals appropriately. |
| Highly Non-linear Data | Use Kernel PCA or t-SNE instead for curved manifolds. |
| Applying PCA After TTS | Always fit PCA on X_train, then transform X_test — or you’ll leak information from test data. |
8. When to Use PCA? (Ideal Scenarios)¶
| Scenario | Use PCA? | Notes |
|---|---|---|
| Too Many Features | Yes | e.g., 100+ columns in CSV |
| Multicollinearity Present | Yes | Good before linear regression |
| Noise in Data | Yes | PCA can filter noise |
| Data Visualization | Yes | Use top 2–3 PCs for plots |
| Sparse or Text Data | No | Use TruncatedSVD or LSA instead |
| You Need Interpretability | No | Use feature selection instead |
| Target is Important | No | Use supervised methods like LDA (Linear Discriminant Analysis) |
9. Choosing Number of Components (k)¶
Use the explained variance ratio:
from sklearn.decomposition import PCA
pca = PCA().fit(X_scaled)
explained_variance = pca.explained_variance_ratio_.cumsum()
# Choose minimum k such that variance > 95%
Plot explained_variance_ratio_.cumsum() to find the elbow point.
10. Best Practices¶
- Standardize the features (e.g.,
StandardScaler) - Use PCA only on numeric features
- Retain 95–99% variance (tunable)
- Always apply PCA after train-test split, not before
- Keep a copy of PCA object to transform future data
11. Alternative Techniques¶
| Method | Use When |
|---|---|
| LDA | You want dimensionality + classification (uses target y) |
| t-SNE / UMAP | For non-linear visualization |
| Feature Selection | When interpretability is critical |
| Autoencoders | For non-linear compression using neural networks |
# PCA Implementation
# Lets do the necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('data.csv')
df.head()
| number_people | date | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 2015-08-14 17:00:11-07:00 | 61211 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 1 | 45 | 2015-08-14 17:20:14-07:00 | 62414 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 2 | 40 | 2015-08-14 17:30:15-07:00 | 63015 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 3 | 44 | 2015-08-14 17:40:16-07:00 | 63616 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 4 | 45 | 2015-08-14 17:50:17-07:00 | 64217 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
df.columns
Index(['number_people', 'date', 'timestamp', 'day_of_week', 'is_weekend',
'is_holiday', 'temperature', 'is_start_of_semester',
'is_during_semester', 'month', 'hour'],
dtype='object')
# Problem Statement:
# Crowdedness at the Campus Gym using PCA
# Data Description for columns: 'number_people', 'date', 'timestamp', 'day_of_week', 'is_weekend' 'is_holiday', 'temperature', 'is_start_of_semester', 'is_during_semester', 'month', 'hour'
# number_people: Number of students present in the gym at a given time
# date: Date of the observation in YYYY-MM-DD format
# timestamp: Time of the observation
# day_of_week: Day of the week (0=Monday, 6=Sunday)
# is_weekend: Boolean indicating if the observation is on a weekend (Saturday or Sunday)
# is_holiday: Boolean indicating if the observation is on a holiday
# temperature: Temperature in degrees Celsius at the time of observation
# is_start_of_semester: Boolean indicating if the observation is during the start of a semester
# is_during_semester: Boolean indicating if the observation is during an active semester
# month: Month of the observation (1=January, 12=December)
# hour: Hour of the day (0-23) when the observation was made
df.describe()
| number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 |
| mean | 29.072543 | 45799.437958 | 2.982504 | 0.282870 | 0.002573 | 58.557108 | 0.078831 | 0.660218 | 7.439824 | 12.236460 |
| std | 22.689026 | 24211.275891 | 1.996825 | 0.450398 | 0.050660 | 6.316396 | 0.269476 | 0.473639 | 3.445069 | 6.717631 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.140000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 9.000000 | 26624.000000 | 1.000000 | 0.000000 | 0.000000 | 55.000000 | 0.000000 | 0.000000 | 5.000000 | 7.000000 |
| 50% | 28.000000 | 46522.500000 | 3.000000 | 0.000000 | 0.000000 | 58.340000 | 0.000000 | 1.000000 | 8.000000 | 12.000000 |
| 75% | 43.000000 | 66612.000000 | 5.000000 | 1.000000 | 0.000000 | 62.280000 | 0.000000 | 1.000000 | 10.000000 | 18.000000 |
| max | 145.000000 | 86399.000000 | 6.000000 | 1.000000 | 1.000000 | 87.170000 | 1.000000 | 1.000000 | 12.000000 | 23.000000 |
df.shape
(62184, 11)
df.corr(numeric_only=True)
| number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| number_people | 1.000000 | 0.550218 | -0.162062 | -0.173958 | -0.048249 | 0.373327 | 0.182683 | 0.335350 | -0.097854 | 0.552049 |
| timestamp | 0.550218 | 1.000000 | -0.001793 | -0.000509 | 0.002851 | 0.184849 | 0.009551 | 0.044676 | -0.023221 | 0.999077 |
| day_of_week | -0.162062 | -0.001793 | 1.000000 | 0.791338 | -0.075862 | 0.011169 | -0.011782 | -0.004824 | 0.015559 | -0.001914 |
| is_weekend | -0.173958 | -0.000509 | 0.791338 | 1.000000 | -0.031899 | 0.020673 | -0.016646 | -0.036127 | 0.008462 | -0.000517 |
| is_holiday | -0.048249 | 0.002851 | -0.075862 | -0.031899 | 1.000000 | -0.088527 | -0.014858 | -0.070798 | -0.094942 | 0.002843 |
| temperature | 0.373327 | 0.184849 | 0.011169 | 0.020673 | -0.088527 | 1.000000 | 0.093242 | 0.152476 | 0.063125 | 0.185121 |
| is_start_of_semester | 0.182683 | 0.009551 | -0.011782 | -0.016646 | -0.014858 | 0.093242 | 1.000000 | 0.209862 | -0.137160 | 0.010091 |
| is_during_semester | 0.335350 | 0.044676 | -0.004824 | -0.036127 | -0.070798 | 0.152476 | 0.209862 | 1.000000 | 0.096556 | 0.045581 |
| month | -0.097854 | -0.023221 | 0.015559 | 0.008462 | -0.094942 | 0.063125 | -0.137160 | 0.096556 | 1.000000 | -0.023624 |
| hour | 0.552049 | 0.999077 | -0.001914 | -0.000517 | 0.002843 | 0.185121 | 0.010091 | 0.045581 | -0.023624 | 1.000000 |
# The temperature given here is in fahrenheit. We will convert it into Celsius using the formula Celsius=(Fahrenheit-32) * (5/9)
Fahrenheit=df['temperature']
# Converting it into the list so we can apply lambda function
F=Fahrenheit.tolist()
# Applying Lambda function
C= map(lambda x: (float(5)/9)*(x-32),F)
Celsius=(list(C))
# Converting list to series
temperature_celsius = pd.Series(Celsius)
# Applying the series to temperature column
df['temperature']= temperature_celsius
df['temperature']
df.head()
# Thus we have converted the temperature column from fahrenheit to degree celsius.
| number_people | date | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 2015-08-14 17:00:11-07:00 | 61211 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 1 | 45 | 2015-08-14 17:20:14-07:00 | 62414 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 2 | 40 | 2015-08-14 17:30:15-07:00 | 63015 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 3 | 44 | 2015-08-14 17:40:16-07:00 | 63616 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 4 | 45 | 2015-08-14 17:50:17-07:00 | 64217 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
X = df.iloc[:,1:] # all rows, all the features and no labels
y = df.iloc[:, 0] # all rows, label only
# Problem Statement:
# Crowdedness at the Campus Gym using PCA
# y - number_people: Number of students present in the gym at a given time
# Therefore, we have to predict the number of people in the gym using PCA based on the features given in the dataset.
X.head()
| date | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-08-14 17:00:11-07:00 | 61211 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 1 | 2015-08-14 17:20:14-07:00 | 62414 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 2 | 2015-08-14 17:30:15-07:00 | 63015 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 3 | 2015-08-14 17:40:16-07:00 | 63616 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 4 | 2015-08-14 17:50:17-07:00 | 64217 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
y.head()
0 37 1 45 2 40 3 44 4 45 Name: number_people, dtype: int64
correlation = df.corr(numeric_only=True)
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between different features')
plt.show()
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62184 entries, 0 to 62183 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 62184 non-null object 1 timestamp 62184 non-null int64 2 day_of_week 62184 non-null int64 3 is_weekend 62184 non-null int64 4 is_holiday 62184 non-null int64 5 temperature 62184 non-null float64 6 is_start_of_semester 62184 non-null int64 7 is_during_semester 62184 non-null int64 8 month 62184 non-null int64 9 hour 62184 non-null int64 dtypes: float64(1), int64(8), object(1) memory usage: 4.7+ MB
X.drop('date',axis=1,inplace=True)
X.columns
Index(['timestamp', 'day_of_week', 'is_weekend', 'is_holiday', 'temperature',
'is_start_of_semester', 'is_during_semester', 'month', 'hour'],
dtype='object')
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62184 entries, 0 to 62183 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 62184 non-null int64 1 day_of_week 62184 non-null int64 2 is_weekend 62184 non-null int64 3 is_holiday 62184 non-null int64 4 temperature 62184 non-null float64 5 is_start_of_semester 62184 non-null int64 6 is_during_semester 62184 non-null int64 7 month 62184 non-null int64 8 hour 62184 non-null int64 dtypes: float64(1), int64(8) memory usage: 4.3 MB
# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shape of the training and testing sets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)
Training set shape: (49747, 9) Testing set shape: (12437, 9) Training labels shape: (49747,) Testing labels shape: (12437,)
# Apply StandardScaler to the features
features = X_train.columns.tolist()
# features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[features] = sc.fit_transform(X_train[features])
X_test[features] = sc.transform(X_test[features])
X_train.shape
(49747, 9)
# PCA
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
# class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)
from sklearn.decomposition import PCA
pca = PCA()
# Fit PCA on the training data
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Explained Variance Ratio: It returns the percentage of variance explained by each of the selected components.
explained_variance = pca.explained_variance_ratio_
print("The principal components explain the following variance:")
for i, var in enumerate(explained_variance):
print(f"Principal Component {i+1}: {var:.4f}")
The principal components explain the following variance: Principal Component 1: 0.2305 Principal Component 2: 0.2003 Principal Component 3: 0.1456 Principal Component 4: 0.1287 Principal Component 5: 0.1019 Principal Component 6: 0.0928 Principal Component 7: 0.0773 Principal Component 8: 0.0229 Principal Component 9: 0.0001
# lets plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio by Principal Components')
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()
# Lets do the modeling using Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
# Fit the model on the training data
model.fit(X_train_pca, y_train)
# Predict on the train and test data
y_train_pred = model.predict(X_train_pca)
y_test_pred = model.predict(X_test_pca)
# Evaluate the model - r2, RMSE
from sklearn.metrics import r2_score, mean_squared_error
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"Training R^2: {train_r2:.4f}")
print(f"Testing R^2: {test_r2:.4f}")
print(f"Training RMSE: {train_rmse:.4f}")
print(f"Testing RMSE: {test_rmse:.4f}")
Training R^2: 0.9861 Testing R^2: 0.9007 Training RMSE: 2.6718 Testing RMSE: 7.1626
# Install plotly
# %pip install --upgrade pip -q
# %pip install plotly -q
Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages.
print(explained_variance,"\n\n")
print(np.cumsum(explained_variance))
[2.30519231e-01 2.00263856e-01 1.45583261e-01 1.28687653e-01 1.01905642e-01 9.27520152e-02 7.73183498e-02 2.28677834e-02 1.02208707e-04] [0.23051923 0.43078309 0.57636635 0.705054 0.80695964 0.89971166 0.97703001 0.99989779 1. ]
# Lets plot the cumulative explained variance using plotly
import plotly.express as px
cumulative_variance = np.cumsum(explained_variance)
fig = px.line(x=range(1, len(cumulative_variance) + 1), y=cumulative_variance, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance by Principal Components')
fig.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig.show()
# This plot shows how much variance is explained as we add more principal components, helping us decide how many components to keep for further analysis or modeling.
# The cumulative explained variance plot is useful for determining the number of principal components to retain in PCA
# based on the desired level of explained variance.
# Lets apply PCA with 2 components
pca_2 = PCA(n_components=2)
X_train_pca_2 = pca_2.fit_transform(X_train)
X_test_pca_2 = pca_2.transform(X_test)
# Now we can visualize the data in 2D using the first two principal components
plt.figure(figsize=(10, 6))
plt.scatter(X_train_pca_2[:, 0], X_train_pca_2[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=50)
plt.colorbar(label='Number of People')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Visualization of Training Data')
plt.show()
# The scatter plot shows the distribution of the training data in the new PCA space, where each point represents a sample, and the color indicates the number of people present in the gym.
# Lets visualize the explained variance ratio for the first two principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 3), explained_variance[:2], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Two Principal Components')
plt.xticks(range(1, 3))
plt.show()
# The bar plot shows the explained variance ratio for the first two principal components, indicating how much variance each component captures in the data.
# Lets visualize the cumulative explained variance for the first two principal components using plotly
cumulative_variance_2 = np.cumsum(explained_variance[:2])
fig_2 = px.line(x=range(1, 3), y=cumulative_variance_2, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for First Two Principal Components')
fig_2.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_2.show()
# The cumulative explained variance plot for the first two principal components shows how much total variance is explained
# when considering both components, helping us understand the overall information captured by these two dimensions.
# Lets apply Random Forest Regressor on the PCA transformed data with 2 components
model_2 = RandomForestRegressor()
# Fit the model on the training data with 2 PCA components
model_2.fit(X_train_pca_2, y_train)
# Predict on the train and test data with 2 PCA components
y_train_pred_2 = model_2.predict(X_train_pca_2)
y_test_pred_2 = model_2.predict(X_test_pca_2)
# Evaluate the model with 2 PCA components - r2, RMSE
train_r2_2 = r2_score(y_train, y_train_pred_2)
test_r2_2 = r2_score(y_test, y_test_pred_2)
train_rmse_2 = np.sqrt(mean_squared_error(y_train, y_train_pred_2))
test_rmse_2 = np.sqrt(mean_squared_error(y_test, y_test_pred_2))
print(f"Training R^2 with 2 PCA components: {train_r2_2:.4f}")
print(f"Testing R^2 with 2 PCA components: {test_r2_2:.4f}")
print(f"Training RMSE with 2 PCA components: {train_rmse_2:.4f}")
print(f"Testing RMSE with 2 PCA components: {test_rmse_2:.4f}")
Training R^2 with 2 PCA components: 0.9750 Testing R^2 with 2 PCA components: 0.8343 Training RMSE with 2 PCA components: 3.5850 Testing RMSE with 2 PCA components: 9.2530
# Lets apply PCA with 3 components
pca_3 = PCA(n_components=3)
X_train_pca_3 = pca_3.fit_transform(X_train)
X_test_pca_3 = pca_3.transform(X_test)
# Now we can visualize the data in 3D using the first three principal components
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train_pca_3[:, 0], X_train_pca_3[:, 1], X_train_pca_3[:, 2], c=y_train, cmap='viridis', edgecolor='k', s=50)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('PCA: 3D Visualization of Training Data')
plt.colorbar(ax.scatter(X_train_pca_3[:, 0], X_train_pca_3[:, 1], X_train_pca_3[:, 2], c=y_train, cmap='viridis', edgecolor='k', s=50), label='Number of People')
plt.show()
# The 3D scatter plot shows the distribution of the training data in the new PCA space, where each point represents a sample, and the color indicates the number of people present in the gym.
# Lets visualize the explained variance ratio for the first three principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 4), explained_variance[:3], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Three Principal Components')
plt.xticks(range(1, 4))
plt.show()
# The bar plot shows the explained variance ratio for the first three principal components, indicating how much variance each component captures in the data.
# Lets visualize the cumulative explained variance for the first three principal components using plotly
cumulative_variance_3 = np.cumsum(explained_variance[:3])
fig_3 = px.line(x=range(1, 4), y=cumulative_variance_3, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for First Three Principal Components')
fig_3.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_3.show()
# The cumulative explained variance plot for the first three principal components shows how much total variance is explained
# when considering all three components, helping us understand the overall information captured by these three dimensions.
# Lets apply Random Forest Regressor on the PCA transformed data with 3 components
model_3 = RandomForestRegressor()
# Fit the model on the training data with 3 PCA components
model_3.fit(X_train_pca_3, y_train)
# Predict on the train and test data with 3 PCA components
y_train_pred_3 = model_3.predict(X_train_pca_3)
y_test_pred_3 = model_3.predict(X_test_pca_3)
# Evaluate the model with 3 PCA components - r2, RMSE
train_r2_3 = r2_score(y_train, y_train_pred_3)
test_r2_3 = r2_score(y_test, y_test_pred_3)
train_rmse_3 = np.sqrt(mean_squared_error(y_train, y_train_pred_3))
test_rmse_3 = np.sqrt(mean_squared_error(y_test, y_test_pred_3))
print(f"Training R^2 with 3 PCA components: {train_r2_3:.4f}")
print(f"Testing R^2 with 3 PCA components: {test_r2_3:.4f}")
print(f"Training RMSE with 3 PCA components: {train_rmse_3:.4f}")
print(f"Testing RMSE with 3 PCA components: {test_rmse_3:.4f}")
Training R^2 with 3 PCA components: 0.9852 Testing R^2 with 3 PCA components: 0.8995 Training RMSE with 3 PCA components: 2.7578 Testing RMSE with 3 PCA components: 7.2088
# Lets apply PCA with 4 components
pca_4 = PCA(n_components=4)
X_train_pca_4 = pca_4.fit_transform(X_train)
X_test_pca_4 = pca_4.transform(X_test)
# Since it is 4D, we wont go for visualization in 4D.
# Lets plot the explained variance ratio for the first four principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 5), explained_variance[:4], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Four Principal Components')
plt.xticks(range(1, 5))
plt.show()
# Lets visualize the cumulative explained variance for the first four principal components using plotly
cumulative_variance_4 = np.cumsum(explained_variance[:4])
fig_4 = px.line(x=range(1, 5), y=cumulative_variance_4, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for First Four Principal Components')
fig_4.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_4.show()
# The cumulative explained variance plot for the first four principal components shows how much total variance is explained
# when considering all four components, helping us understand the overall information captured by these four dimensions.
# Lets apply Random Forest Regressor on the PCA transformed data with 4 components
model_4 = RandomForestRegressor()
# Fit the model on the training data with 4 PCA components
model_4.fit(X_train_pca_4, y_train)
# Predict on the train and test data with 4 PCA components
y_train_pred_4 = model_4.predict(X_train_pca_4)
y_test_pred_4 = model_4.predict(X_test_pca_4)
# Evaluate the model with 4 PCA components - r2, RMSE
train_r2_4 = r2_score(y_train, y_train_pred_4)
test_r2_4 = r2_score(y_test, y_test_pred_4)
train_rmse_4 = np.sqrt(mean_squared_error(y_train, y_train_pred_4))
test_rmse_4 = np.sqrt(mean_squared_error(y_test, y_test_pred_4))
print(f"Training R^2 with 4 PCA components: {train_r2_4:.4f}")
print(f"Testing R^2 with 4 PCA components: {test_r2_4:.4f}")
print(f"Training RMSE with 4 PCA components: {train_rmse_4:.4f}")
print(f"Testing RMSE with 4 PCA components: {test_rmse_4:.4f}")
Training R^2 with 4 PCA components: 0.9884 Testing R^2 with 4 PCA components: 0.9211 Training RMSE with 4 PCA components: 2.4465 Testing RMSE with 4 PCA components: 6.3857
# Lets apply PCA with 5 components
pca_5 = PCA(n_components=5)
X_train_pca_5 = pca_5.fit_transform(X_train)
X_test_pca_5 = pca_5.transform(X_test)
# Since it is 5D, we wont go for visualization in 5D.
# Lets plot the explained variance ratio for the first five principal components
plt.figure(figsize=(10, 6))
plt.bar(range(1, 6), explained_variance[:5], alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for First Five Principal Components')
plt.xticks(range(1, 6))
plt.show()
# Lets visualize the cumulative explained variance for the first five principal components using plotly
cumulative_variance_5 = np.cumsum(explained_variance[:5])
fig_5 = px.line(x=range(1, 6), y=cumulative_variance_5, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for First Five Principal Components')
fig_5.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_5.show()
# The cumulative explained variance plot for the first five principal components shows how much total variance is explained
# when considering all five components, helping us understand the overall information captured by these five dimensions.
# Lets apply Random Forest Regressor on the PCA transformed data with 5 components
model_5 = RandomForestRegressor()
# Fit the model on the training data with 5 PCA components
model_5.fit(X_train_pca_5, y_train)
# Predict on the train and test data with 5 PCA components
y_train_pred_5 = model_5.predict(X_train_pca_5)
y_test_pred_5 = model_5.predict(X_test_pca_5)
# Evaluate the model with 5 PCA components - r2, RMSE
train_r2_5 = r2_score(y_train, y_train_pred_5)
test_r2_5 = r2_score(y_test, y_test_pred_5)
train_rmse_5 = np.sqrt(mean_squared_error(y_train, y_train_pred_5))
test_rmse_5 = np.sqrt(mean_squared_error(y_test, y_test_pred_5))
print(f"Training R^2 with 5 PCA components: {train_r2_5:.4f}")
print(f"Testing R^2 with 5 PCA components: {test_r2_5:.4f}")
print(f"Training RMSE with 5 PCA components: {train_rmse_5:.4f}")
print(f"Testing RMSE with 5 PCA components: {test_rmse_5:.4f}")
Training R^2 with 5 PCA components: 0.9882 Testing R^2 with 5 PCA components: 0.9190 Training RMSE with 5 PCA components: 2.4613 Testing RMSE with 5 PCA components: 6.4684
# Lets plot r2 for train and test data for different number of components 9,2,3,4,5
import matplotlib.pyplot as plt
components = [2, 3, 4, 5, 9]
train_r2_values = [train_r2_2, train_r2_3, train_r2_4, train_r2_5, train_r2]
test_r2_values = [test_r2_2, test_r2_3, test_r2_4, test_r2_5, test_r2]
plt.figure(figsize=(10, 6))
plt.plot(components, train_r2_values, marker='o', label='Training R^2')
plt.plot(components, test_r2_values, marker='o', label='Testing R^2')
plt.xlabel('Number of Principal Components')
plt.ylabel('R^2 Score')
plt.title('R^2 Score for Different Number of Principal Components')
plt.legend()
plt.xticks(components)
plt.grid()
plt.show()
# The line plot shows the R^2 scores for both training and testing data as the number of principal components increases.
# PCA
# class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)
# n_components: int, float or ‘mle’, default=None
# Lets apply PCA to capture 80% of variance
pca_80 = PCA(n_components=0.80)
X_train_pca_80 = pca_80.fit_transform(X_train)
X_test_pca_80 = pca_80.transform(X_test)
# Lets not visualize the data
# Lets plot the explained variance ratio for the PCA with 80% variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca_80.explained_variance_ratio_) + 1), pca_80.explained_variance_ratio_, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for PCA with 80% Variance')
plt.xticks(range(1, len(pca_80.explained_variance_ratio_) + 1))
plt.show()
# Lets visualize the cumulative explained variance for the PCA with 80% variance using plotly
cumulative_variance_80 = np.cumsum(pca_80.explained_variance_ratio_)
fig_80 = px.line(x=range(1, len(cumulative_variance_80) + 1), y=cumulative_variance_80, labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for PCA with 80% Variance')
fig_80.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_80.show()
# The cumulative explained variance plot for PCA with 80% variance shows how much total variance is explained
# when considering the principal components that capture 80% of the variance, helping us understand
# the overall information captured by these components.
# Lets apply Random Forest Regressor on the PCA transformed data with 80% variance
model_80 = RandomForestRegressor()
# Fit the model on the training data with PCA components that capture 80% variance
model_80.fit(X_train_pca_80, y_train)
# Predict on the train and test data with PCA components that capture 80% variance
y_train_pred_80 = model_80.predict(X_train_pca_80)
y_test_pred_80 = model_80.predict(X_test_pca_80)
# Evaluate the model with PCA components that capture 80% variance - r2, RMSE
train_r2_80 = r2_score(y_train, y_train_pred_80)
test_r2_80 = r2_score(y_test, y_test_pred_80)
train_rmse_80 = np.sqrt(mean_squared_error(y_train, y_train_pred_80))
test_rmse_80 = np.sqrt(mean_squared_error(y_test, y_test_pred_80))
print(f"Training R^2 with PCA components that capture 80% variance: {train_r2_80:.4f}")
print(f"Testing R^2 with PCA components that capture 80% variance: {test_r2_80:.4f}")
print(f"Training RMSE with PCA components that capture 80% variance: {train_rmse_80:.4f}")
print(f"Testing RMSE with PCA components that capture 80% variance: {test_rmse_80:.4f}")
Training R^2 with PCA components that capture 80% variance: 0.9883 Testing R^2 with PCA components that capture 80% variance: 0.9192 Training RMSE with PCA components that capture 80% variance: 2.4549 Testing RMSE with PCA components that capture 80% variance: 6.4608
t-SNE (t-distributed Stochastic Neighbor Embedding)¶
1. What is t-SNE?¶
t-SNE is a non-linear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton. It is mainly used for visualizing high-dimensional data in two or three dimensions.
Unlike PCA, which preserves global structure and variance, t-SNE focuses on preserving local structure — meaning it aims to keep similar points close together in the low-dimensional space.
2. Why Use t-SNE?¶
In many real-world datasets, the number of features (columns) is high. Visualizing such data becomes impossible. t-SNE helps to:
- Reduce high-dimensional data to 2D or 3D for visual exploration.
- Reveal hidden patterns, groupings, or clusters.
- Understand relationships between data points without building a model.
3. Intuition Behind t-SNE¶
Here’s how t-SNE works in simplified terms:
a. High-Dimensional Similarities¶
- t-SNE calculates the probability that two data points are similar based on their distances.
- Nearby points have high probabilities; distant points have low probabilities.
- These probabilities are computed using Gaussian distribution in high dimensions.
b. Low-Dimensional Similarities¶
- In the target low-dimensional space (usually 2D), t-SNE tries to recreate a similar structure.
- But instead of a Gaussian, it uses a Student's t-distribution (with 1 degree of freedom) to measure pairwise similarities.
c. Minimize Divergence¶
t-SNE minimizes the Kullback–Leibler (KL) divergence between the two probability distributions:
- High-dimensional (input space)
- Low-dimensional (visualization space)
The goal is to map similar points close together and dissimilar points far apart.
4. Key Parameters in t-SNE¶
| Parameter | Description | Tips |
|---|---|---|
n_components |
Number of output dimensions (usually 2 or 3) | Use 2D for plots |
perplexity |
Balance between local and global structure (roughly: number of neighbors to consider) | Typical range: 5 to 50 |
learning_rate |
Step size for optimization | Range: 10 to 1000; too low/high may fail |
n_iter |
Number of optimization iterations | At least 250; 1000+ recommended |
random_state |
Ensures reproducibility | Use fixed seed like 42 |
5. Example Use Cases¶
- Visualizing word embeddings (like Word2Vec or GloVe)
- Customer segmentation in marketing
- Clustering of gene expression profiles
- Analyzing MNIST handwritten digits
- Understanding embeddings from deep learning models
6. What Does a t-SNE Plot Show?¶
- Each point in the plot represents a high-dimensional data point (e.g., a row in your dataset).
- Points that appear close together in the plot were similar in the original high-dimensional space.
- If you color points by class label (if available), you'll often see natural clustering or class separation.
7. Example Interpretation¶
Consider a t-SNE plot of handwritten digits:
- Cluster of “3”s appears in one region.
- Cluster of “8”s is nearby but separate.
- Some overlapping may exist if digits are visually similar.
This indicates that the digit embeddings are separable based on their latent features, and classes are locally grouped.
8. Limitations of t-SNE¶
| Limitation | Explanation |
|---|---|
| Not deterministic | Can yield different plots each run (unless random_state is fixed) |
| No inverse transform | You can't reconstruct original data from t-SNE outputs |
| Only visualization | t-SNE is not meant for preprocessing before modeling |
| Computationally intensive | Slow for large datasets |
| Misleading global structure | t-SNE preserves local structure well, not global distances |
9. Best Practices¶
- Standardize the data before applying t-SNE.
- Try different perplexity values to find stable patterns.
- Use color coding by labels to understand clusters better.
- Avoid using t-SNE for preprocessing before model training.
- Don’t over-interpret distances between far-apart clusters.
10. When to Use t-SNE?¶
| Scenario | Use t-SNE? |
|---|---|
| You want to visualize high-dimensional data | Yes |
| You need to find natural clusters | Yes |
| You need a model-ready reduced dataset | No (use PCA instead) |
| You want to reverse-transform to original space | No (not supported) |
| Your dataset has >10,000 samples | Use with caution; may be slow |
11. Alternatives to t-SNE¶
| Method | When to Use |
|---|---|
| PCA | When variance explanation and interpretability are important |
| UMAP | Faster, more scalable than t-SNE; preserves more global structure |
| Autoencoders | For non-linear dimensionality reduction with reconstruction |
| ISOMAP | For manifold learning where global geometry matters |
X.head()
| timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 61211 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 1 | 62414 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 2 | 63015 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 3 | 63616 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
| 4 | 64217 | 4 | 0 | 0 | 22.088889 | 0 | 0 | 8 | 17 |
y.head()
0 37 1 45 2 40 3 44 4 45 Name: number_people, dtype: int64
# Scaling the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Scaling the features is important for PCA as it is sensitive to the variances of the features.
# Apply t-SNE for visualization on X_scaled
from sklearn.manifold import TSNE
# Fit and transform the PCA-reduced data using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Visualize the t-SNE results
plt.figure(figsize=(10, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.colorbar(label='Number of People')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Visualization of PCA-Reduced Data')
plt.show()
# The t-SNE plot shows the distribution of the data in a 2D space,
# where each point represents a sample, and the color indicates the number of people present in the gym.
# t-SNE is particularly useful for visualizing high-dimensional data in a lower-dimensional space
# while preserving the local structure of the data.
The t-SNE plot titled "t-SNE Visualization of PCA-Reduced Data" provides a meaningful 2D representation of the gym crowdedness dataset. The analysis reveals the following insights:
High-dimensional data was effectively reduced using PCA followed by t-SNE, enabling visual exploration of complex patterns.
Color intensity indicates the number of people present in the gym.
- Darker points represent lower crowd levels.
- Brighter yellow/green points represent higher crowd levels.
Cluster patterns are clearly visible, indicating that certain combinations of features (e.g., time, day, conditions) correspond to distinct crowd behavior profiles.
Transitional regions suggest gradual changes in crowdedness, likely due to overlapping behavioral patterns or moderate usage hours.
This visualization can support:
- Identifying peak vs. off-peak periods
- Segmenting gym users by behavioral trends
- Strategic resource planning (staffing, facility usage) based on crowd clusters
Overall, the t-SNE visualization provides actionable insight into gym usage dynamics and highlights natural groupings in the data that can inform operational decisions.
Key Takeaways:¶
Each dot in the plot is a snapshot (e.g., a specific time or day) when people were present in the gym.
The color of the dot shows how crowded the gym was at that moment:
- Darker purple means fewer people.
- Brighter yellow means more people.
The dots are grouped based on similar usage patterns.
- Areas where dots are close together mean similar crowd levels and behavior.
- Spread-out areas indicate different or unusual crowd levels.
Clusters of similar colors show that the gym tends to have consistent crowd levels at certain times or under certain conditions (for example, weekday evenings might form a high-crowd cluster).
Linear Discriminant Analysis (LDA)¶
1. What is LDA?¶
LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction technique used primarily for classification tasks.
- Unlike PCA, which maximizes variance without considering class labels, LDA seeks to maximize class separability.
- LDA projects data onto a lower-dimensional space where the classes are most distinguishable.
2. Why Use LDA?¶
| Reason | Explanation |
|---|---|
| Classification-Focused | LDA improves separation between classes using label (y) information. |
| Reduce Dimensionality | Just like PCA, it reduces input features to fewer linear combinations. |
| Improve Model Accuracy | Enhances performance of classifiers by simplifying feature space. |
| Better Visualization | Enables 2D or 3D visual analysis of multi-class problems. |
3. How Does LDA Work? (Step-by-Step)¶
Compute Class-wise Mean Vectors
- Calculate the mean vector for each class in the dataset.
Compute Scatter Matrices
- Within-class scatter matrix (SW): How data points in a class vary among themselves.
- Between-class scatter matrix (SB): How class means vary from the overall mean.
Solve the Generalized Eigenvalue Problem
- Solve the matrix equation to find eigenvectors and eigenvalues for
inv(SW) * SB.
- Solve the matrix equation to find eigenvectors and eigenvalues for
Select Linear Discriminants
- Rank eigenvectors by eigenvalues.
- Select top
keigenvectors to form the LDA projection matrix.
Project Data
- Multiply original data with the LDA matrix to transform into lower dimensions.
4. Key Differences: LDA vs PCA¶
| Feature | PCA | LDA |
|---|---|---|
| Supervision | Unsupervised (ignores y) |
Supervised (uses y) |
| Goal | Maximize variance | Maximize class separation |
| Components | At most min(n_samples, n_features) | At most (n_classes - 1) |
| Output | Principal Components | Linear Discriminants |
| Assumes Distribution | No assumption | Gaussian class distribution |
5. Important Pointers¶
| Concept | Details |
|---|---|
Uses y (labels)? |
Yes, LDA is supervised. |
| Max Dimensions | Limited to n_classes - 1 components. |
| Gaussian Assumption | Assumes each class is normally distributed with same covariance. |
| Linearity | LDA creates linear decision boundaries. |
| Transformation | Linear projection just like PCA, but more class-aware. |
6. Advantages of LDA¶
- Improves classification by focusing on class separability.
- Reduces dimensionality and noise.
- Produces interpretable projections aligned with class structure.
- Often improves performance of classifiers like logistic regression, SVM, and Naive Bayes.
7. Disadvantages of LDA¶
- Assumes normal distribution and equal covariance across classes — often unrealistic in real-world data.
- Works poorly if classes are not linearly separable.
- Struggles with imbalanced datasets — may bias towards majority class.
- Limited to
n_classes - 1features — may not reduce enough dimensions in multi-class problems.
8. Corner Cases / Pitfalls¶
| Case | Problem |
|---|---|
| Non-Gaussian Features | LDA may misrepresent class separability. |
| Highly Imbalanced Classes | Between-class variance may get distorted. |
| Too Few Samples | Leads to unstable covariance matrix (especially in high-dimensional settings). |
| Missing Values | Must be handled before applying LDA. |
| Heteroscedasticity | Unequal variances across classes violate assumptions. |
9. When to Use LDA?¶
| Scenario | Use LDA? | Notes |
|---|---|---|
| You want to reduce dimensions and improve classification | Yes | LDA performs well when assumptions are roughly met. |
| Classes are linearly separable | Yes | LDA finds optimal projection directions. |
| Need 2D visualization of multi-class data | Yes | LDA offers meaningful views of class structure. |
| High dimensional dataset with few classes | Yes | LDA is ideal when n_classes << n_features. |
| Target variable not available | No | LDA cannot be used without y. |
| Non-linear class boundaries | No | Try kernel LDA or t-SNE instead. |
10. Best Practices¶
- Standardize features if they differ in scale.
- Handle missing data before applying LDA.
- Use with classification models like SVM, logistic regression, or Naive Bayes.
- Evaluate LDA assumptions: normality and equal covariances (optional, but recommended).
- Plot explained variance ratio to choose number of discriminants.
11. Example: LDA for Classification¶
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
# Train a classifier on X_lda
clf.fit(X_lda, y)
12. Alternative Techniques¶
| Method | Use When |
|---|---|
| PCA | You only want variance capture, not class separation |
| Kernel LDA | Non-linear class separability |
| t-SNE / UMAP | Non-linear data visualization |
| Feature Selection | You want to retain original feature meaning |
# Lets apply LDA on iris dataset from seaborn
import seaborn as sns
iris = sns.load_dataset('iris')
X = iris.drop('species', axis=1)
y = iris['species']
# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scaling the features
features = X_train.columns.tolist()
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[features] = sc.fit_transform(X_train[features])
X_test[features] = sc.transform(X_test[features])
# Apply LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2) # n_components should be less than the number of classes - 1
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)
# Visualize the LDA results
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_train_lda[:, 0], y=X_train_lda[:, 1], hue=y_train, palette='viridis', edgecolor='k', s=100)
plt.xlabel('LDA Component 1')
plt.ylabel('LDA Component 2')
plt.title('LDA Visualization of Iris Dataset')
plt.legend(title='Species')
plt.show()
# The LDA plot shows the distribution of the training data in the new LDA space,
# where each point represents a sample, and the color indicates the species of the iris flower.
# LDA is particularly useful for classification tasks as it maximizes the separation between classes while minimizing the variance within each class.
# Evaluate the LDA model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Fit a classifier (e.g., Random Forest) on the LDA-transformed data
from sklearn.ensemble import RandomForestClassifier
model_lda = RandomForestClassifier(random_state=42)
model_lda.fit(X_train_lda, y_train)
# Predict on the train and test data
y_train_pred_lda = model_lda.predict(X_train_lda)
y_test_pred_lda = model_lda.predict(X_test_lda)
# Evaluate the model - accuracy, classification report, confusion matrix
train_accuracy_lda = accuracy_score(y_train, y_train_pred_lda)
test_accuracy_lda = accuracy_score(y_test, y_test_pred_lda)
train_classification_report_lda = classification_report(y_train, y_train_pred_lda)
test_classification_report_lda = classification_report(y_test, y_test_pred_lda)
train_confusion_matrix_lda = confusion_matrix(y_train, y_train_pred_lda)
test_confusion_matrix_lda = confusion_matrix(y_test, y_test_pred_lda)
print(f"Training Accuracy with LDA: {train_accuracy_lda:.4f}")
print(f"Testing Accuracy with LDA: {test_accuracy_lda:.4f}")
print("Training Classification Report with LDA:\n", train_classification_report_lda)
print("Testing Classification Report with LDA:\n", test_classification_report_lda)
print("Training Confusion Matrix with LDA:\n", train_confusion_matrix_lda)
print("Testing Confusion Matrix with LDA:\n", test_confusion_matrix_lda)
# The accuracy scores, classification reports, and confusion matrices provide insights into the model's performance on both the training and testing datasets.
# The classification report includes precision, recall, and F1-score for each class, while the confusion matrix shows the number of correct and incorrect predictions for each class.
Training Accuracy with LDA: 1.0000
Testing Accuracy with LDA: 1.0000
Training Classification Report with LDA:
precision recall f1-score support
setosa 1.00 1.00 1.00 40
versicolor 1.00 1.00 1.00 41
virginica 1.00 1.00 1.00 39
accuracy 1.00 120
macro avg 1.00 1.00 1.00 120
weighted avg 1.00 1.00 1.00 120
Testing Classification Report with LDA:
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Training Confusion Matrix with LDA:
[[40 0 0]
[ 0 41 0]
[ 0 0 39]]
Testing Confusion Matrix with LDA:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Dimensionality Reduction Using PCA¶
# Dimensionality Reduction Using PCA
value = pd.read_csv(
filepath_or_buffer="https://raw.githubusercontent.com/insaid2018/pca-file/master/train.csv")
print('Shape of the dataset:', value.shape)
value.head()
# https://raw.githubusercontent.com/insaid2018/pca-file/master/train.csv
Shape of the dataset: (4459, 4993)
| ID | target | 48df886f9 | 0deb4b6a8 | 34b15f335 | a8cb14b00 | 2f0771a37 | 30347e683 | d08d1fbe3 | 6ee66e115 | ... | 3ecc09859 | 9281abeea | 8675bec0b | 3a13ed79a | f677d4d13 | 71b203550 | 137efaa80 | fb36b89d9 | 7e293fbaf | 9fc776466 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 000d6aaf2 | 38000000.0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 000fbd867 | 600000.0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0027d6b71 | 10000000.0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0028cbf45 | 2000000.0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 002a68644 | 14400000.0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 4993 columns
Observations:
We are provided with an anonymized dataset.
The dataset contains 4459 observations and 4993 columns.
The target feature is numeric and have an average value of 5944923 units.
It ranges from 300000 units all the way upto 40000000 units.
X = value.drop(labels=['target'], axis=1)
y = value['target']
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=42)
# Display the shape of training and testing data
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
X_train shape: (3567, 4992) y_train shape: (3567,) X_test shape: (892, 4992) y_test shape: (892,)
# Filter all the columns of dtype float64 and int64
X_train = X_train.select_dtypes(include=['float64', 'int64'])
X_test = X_test.select_dtypes(include=['float64', 'int64'])
# Instantiating a standard scaler object
scaler = StandardScaler()
# Transforming our data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Applying PCA to capture 80% of variance
pca = PCA(n_components=0.80)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Display the shape of PCA transformed data
print('X_train_pca shape: ', X_train_pca.shape)
print('X_test_pca shape: ', X_test_pca.shape)
# Visualizing the explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, alpha=0.7, align='center')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.title('Explained Variance Ratio for PCA with 80% Variance')
plt.xticks(range(1, len(pca.explained_variance_ratio_) + 1))
plt.show()
# Visualizing the cumulative explained variance for PCA with 80% variance using plotly
cumulative_variance_pca = np.cumsum(pca.explained_variance_ratio_)
fig_pca = px.line(x=range(1, len(cumulative_variance_pca) + 1), y=cumulative_variance_pca,
labels={'x': 'Principal Components', 'y': 'Cumulative Explained Variance'},
title='Cumulative Explained Variance for PCA with 80% Variance')
fig_pca.update_layout(xaxis=dict(tickmode='linear', tick0=1, dtick=1))
fig_pca.show()
# Applying Random Forest Regressor on the PCA transformed data with 80% variance
model_pca = RandomForestRegressor(random_state=42)
# Fit the model on the training data with PCA components that capture 80% variance
model_pca.fit(X_train_pca, y_train)
# Predict on the train and test data with PCA components that capture 80% variance
y_train_pred_pca = model_pca.predict(X_train_pca)
y_test_pred_pca = model_pca.predict(X_test_pca)
# Evaluate the model with PCA components that capture 80% variance - r2, RMSE
train_r2_pca = r2_score(y_train, y_train_pred_pca)
test_r2_pca = r2_score(y_test, y_test_pred_pca)
train_rmse_pca = np.sqrt(mean_squared_error(y_train, y_train_pred_pca))
test_rmse_pca = np.sqrt(mean_squared_error(y_test, y_test_pred_pca))
print(f"Training R^2 with PCA components that capture 80% variance: {train_r2_pca:.4f}")
print(f"Testing R^2 with PCA components that capture 80% variance: {test_r2_pca:.4f}")
print(f"Training RMSE with PCA components that capture 80% variance: {train_rmse_pca:.4f}")
print(f"Testing RMSE with PCA components that capture 80% variance: {test_rmse_pca:.4f}")
X_train_pca shape: (3567, 720) X_test_pca shape: (892, 720)
Training R^2 with PCA components that capture 80% variance: 0.8825 Testing R^2 with PCA components that capture 80% variance: 0.0333 Training RMSE with PCA components that capture 80% variance: 2878888.3490 Testing RMSE with PCA components that capture 80% variance: 7404056.9097